Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

31.4 Data Mining

387

31.3

Knowledge Representation

Most obviously, knowledge representation is a medium of human expression, typ-

ically a language. In bioinformatics, the representation should be chosen to assist

computation; for example, the attributes of an object being optimized using evolu-

tionary computation (Sect. 4.3) have to be encoded in the (artiﬁcial) chromosome; it

may be sufﬁcient to represent their presence by “1” and their absence by “0”, in the

case of binary encoding.

Ideally, the representation should provide a guide to the organization of

information—indeed knowledge might be deﬁned as “organized (structured) infor-

mation”. Thus, the ontologies discussed in the previous section are an attempt to

represent knowledge in this spirit. The most desirable kind of organization is that

which facilitates making inductive inferences—and this will be most successfully

achieved if as few preconceptions as possible are imposed on the organization.

Powerful ways of representing knowledge need not involve words, or symbolic

strings, at all. Visualization (cf. Sect. 13.4) may be much more revealing than a verbal

description. A particular advantage is the possibility of rearranging materials in two,

rather than in one, dimension. In this regard, languages based on ideographs, most

notably Chinese, would appear to be very powerful, since concepts can be rearranged

on a sheet of paper and novel juxtapositions can be freely generated.

As knowledge becomes more and more complex, good examples of which are

the organization of living organisms (Fig. 14.1) and their regulation (e.g., Fig. ??),

novel ways of representing it need to be creatively explored. One approach that may

prove useful is to represent knowledge as probability distributions, conditional upon

more or less certain facts emanating from observations or laboratory experiments;

as more data becomes available, inferences can then be continuously updated in a

far more systematic manner than is currently carried out today.

31.4

Data Mining

The goal of data mining is usually stated as ﬁnding meaningful new patterns from a

mass of more or less unstructured data (the ore in the mining analogy, a great part of

which will be discarded as gangue). In a nutshell, it is the process of analysing large

datasets to discover patterns and insights. It involves applying algorithms and statis-

tical methods to identify relationships and correlations between different variables.

It is hoped that data mining can be used to uncover trends unperceived by a human

observer. Hence, it is sometimes called knowledge discovery in databases (KDD).

The primary motivation is the vast accumulation of data from high-throughput tech-

nologies, including nucleic acid sequencing and microarrays. There is an underlying

notion that “knowledge” or “meaning” can be self-revealing; depending on the deﬁ-

nitions of these terms (cf. Chap. 6) this goal may be illusory, much like the notion of